LIMBO: A Scalable Algorithm to Cluster Categorical Data
نویسندگان
چکیده
Clustering is a problem of great practical importance in numerous applications. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values. In this work, we introduce LIMBO, a scalable hierarchical categorical clustering algorithm that builds on the Information Bottleneck (IB) framework for quantifying the relevant information preserved when clustering. We use the IB framework to define a distance measure for categorical tuples and we also present a novel distance measure for categorical attribute values. We show how the LIMBO algorithm can be used to cluster both tuples and attribute values. LIMBO handles large data sets by producing a summary model for the data. We propose two different versions of LIMBO, where we either control the size or the accuracy of the model. We present an experimental evaluation of both versions of LIMBO, and we study how clustering quality in information theoretic clustering algorithms compares to other categorical clustering algorithms. LIMBO also supports a tradeoff between efficiency (in terms of space and time). We quantify this trade-off and we demonstrate that LIMBO allows for substantial improvements in efficiency with negligible decrease in quality. LIMBO is a hierarchical algorithm that produces clusterings for a range of k values (where k is the number of clusters). We take advantage of this feature to examine heuristics for selecting good clusterings (with natural values of k) within this range. 1
منابع مشابه
LIMBO: Scalable Clustering of Categorical Data
Clustering is a problem of great practical importance in numerous applications. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values. We introduce LIMBO, a scalable hierarchical categorical clustering algorithm that builds on the Information Bottleneck (IB) framework for quantifying the relevant ...
متن کاملScalable Clustering of Categorical Data and Applications
Scalable Clustering of Categorical Data and Applications Periklis Andritsos Doctor of Philosophy Graduate Department of Computer Science University of Toronto 2004 Clustering is widely used to explore and understand large collections of data. In this thesis, we introduce LIMBO, a scalable hierarchical categorical clustering algorithm based on the Information Bottleneck (IB) framework for quanti...
متن کاملEvaluating Value Weighting Schemes in the Clustering of Categorical Data
The majority of the algorithms in the clustering literature utilize data sets with numerical values. Recently, new and scalable algorithms have been proposed to cluster data sets with categorical data, data whose inherent ordering is not obvious. However, these algorithms deem all data values present in the data sets as equally important. Thus, the resulting clusters may be influenced by values...
متن کاملUsing Categorical Clustering in Schema Discovery
Most techniques for managing relational schemas assume a given schema that adequately models the data [1]. However, we know that in practice, the semantics of the data may evolve over time and its schema (its table structures and constraints) is not always updated to reflect these changes [5]. Common examples include the overloading of tables to store facts of different types (for example, an o...
متن کاملScalable Hierarchical Clustering Method for Sequences of Categorical Values
Data clustering methods have many applications in the area of data mining. Traditional clustering algorithms deal with quantitative or categorical data points. However, there exist many important databases that store categorical data sequences, where significant knowledge is hidden behind sequential dependencies between the data. In this paper we introduce a problem of clustering categorical da...
متن کامل